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Abstract 

A wide class of regularization problems in machine learning and statistics employ a reg- 
ularization term which is obtained by composing a simple convex function with a linear 
transformation. This setting includes Group Lasso methods, the Fused Lasso and other total 
variation methods, multi-task learning methods and many more. In this paper, we present a 
general approach for computing the proximity operator of this class of regularizers, under the 
assumption that the proximity operator of the function co is known in advance. Our approach 
builds on a recent line of research on optimal first order optimization methods and uses fixed 
point iterations for numerically computing the proximity operator. It is more general than cur- 
rent approaches and, as we show with numerical simulations, computationally more efficient 
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than available first order methods which do not achieve the optimal rate. In particular, our 
method outperforms state of the art 0(y ) methods for overlapping Group Lasso and matches 
optimal 0{jrj) methods for the Fused Lasso and tree structured Group Lasso. 

1 Introduction 

In this paper, we study supervised learning methods which are based on the optimization problem 

mmf{x)+g{x) (1.1) 

where the function / measures the fit of a vector x to available training data and g is a penalty 
term or regularizer which encourages certain types of solutions. More precisely we let f{x) = 
E(y, Ax), where ii^ : x — )• [0, oo) is an error function, y G is vector of measurements 
and A G a matrix, whose rows are the input vectors. This class of regularization methods 
arise in machine learning, signal processing and statistics and have a wide range of applications. 

Different choices of the error function and the penalty function correspond to specific meth- 
ods. In this paper, we are interested in solving problem (|l.ll) when / is a strongly smooth convex 
function (such as the square error E(y, Ax) = \\y — Ax\\l) and the penalty function g is obtained 
as the composition of a "simple" function with a linear transformation B, that is, 

g{x)=u{Bx) (1.2) 

where i? is a prescribed m x d matrix and is a nondijferentiable convex function on W^. The 
class of regularizers (|1.21 i includes a plethora of methods, depending on the choice of the function 
uj and of matrix B. Our motivation for studying this class of penalty functions arises from sparsity- 
inducing regularization methods which consider u to be either the £i norm or a mixed norm. 
When B is the identity matrix and p = 2, the latter case corresponds to the well-known Group 
Lasso method [|36l , for which well studied optimization techniques are available. Other choices 
of the matrix B give rise to different kinds of Group Lasso with overlapping groups [|T2l l38l . 
which have proved to be effective in modeling structured sparse regression problems. Further 
examples can be obtained considering composition with the ii norm (e.g. this includes the Fused 
Lasso penalty function [32] and other total variation methods [21]) as well as composition with 
orthogonally invariant norms, which are relevant, for example, in the context of multi-task learning 

m. 

A common approach to solve many optimization problems of the general form (11.11 ) is via 
proximal methods. These are first-order iterative methods, whose computational cost per iteration 
is comparable to gradient descent. In some problems in which g has a simple enough form, they 
can be combined with acceleration techniques [(3l |26l |28l |33l |34l> to yield significant gains in 
the number of iterations required to reach a certain approximation accuracy of the minimal value. 
The essential step of proximal methods requires the computation of the proximity operator of 
function g (see Definition 12 . 1 I below) . In certain cases of practical importance, this operator admits 
a closed form, which makes proximal methods appealing to use. However, in the general case 
(11.2! ) the proximity operator may not be easily computable. We are aware of techniques to compute 
this operator for only some specific choices of the function cu and the matrix B. Most related to 
our work are recent papers for Group Lasso with overlap [17] and Fused Lasso [[191 . See also 
[[B 121 [Ml [20l |24l for other optimization methods for structured sparsity. 
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The main contribution of this paper is a general technique to compute the proximity operator of 
the composite regularizer (11.21 ) from the solution of a certain fixed point problem, which depends 
on the proximity operator of the function to and the matrix B. This fixed point problem can be 
solved by a simple and efficient iterative scheme when the proximity operator of to has a closed 
form or can be computed in a finite number of steps. When / is a strongly smooth function, 
the above result can be used together with Nesterov's accelerated method [|26l l28l to provide an 
efficient first-order method for solving the optimization problem (11.11) . Thus, our technique allows 
for the application of proximal methods on a much wider class of optimization problems than is 
currently possible. Our technique is both more general than current approaches and also, as we 
argue with numerical simulations, computationally efficient. In particular, we will demonstrate that 
our method outperforms state of the art (9(^) methods for overlapping Group Lasso and matches 
optimal O(^) methods for the Fused Lasso and tree structured Group Lasso. 

The paper is organized as follows. In Section |2l we review the notion of proximity operator 
and useful facts from fixed point theory. In Section |3l we discuss some examples of composite 
functions of the form (11.21 ) which are valuable in applications. In Section |4l we present our tech- 
nique to compute the proximity operator for a composite regularizer of the form (II. 2t and then an 
algorithm to solve the associated optimization problem (11.11) . In Section[5l we report our numerical 
experience with this method. 

2 Background 

We denote by (-, ■) the Euclidean inner product on and let || ■ II2 be the induced norm. If 
f : M — > M, for every x E we denote by v(x) the vector (f (xj) : i E N^), where, for every 
integer d, we use Nd as a shorthand for the set {1, . . . , d}. For every p > 1, we define the £p norm 



The proximity operator on a Hilbert space was introduced by Moreau in [|22ll23l . 

Definition 2.1. Let u be a real valued convex function on W^. The proximity operator of uj is 
defined, for every x eW^ hy 



The proximity operator is well defined, because the above minimum exists and is unique. 
Recall that the subdifferential of a convex function w at a; is defined as 



The subdifferential is a nonempty compact and convex set. Moreover, if uj is differentiable at x 
then its subdifferential at x consists only of the gradient of bj at x. The next proposition establishes 
a relationship between the proximity operator and the subdifferential of a; - see, for example, [[2T1 
Prop. 2.6] for a proof. 

Proposition 2.1. If u is a convex function on M*^ and y eW^ then 



of a; as ||x||p = i^.^ 





(2.1) 



du[x) = {u : M e M'^, (y - X, u) + uj{x) < co{y), y E R'^}. 



X E duj{y) if and only if y = prox^(x + y) . 
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We proceed to discuss some examples of functions u and the corresponding proximity opera- 
tors. 

If ijj{x) = A||x||^, where A is a positive parameter, we have that 

prox^(x) = h^^ {\x\)sign{x) (2.2) 

where the function : [0, oo) — )■ [0, oo) is defined, for every t > 0, as h{t) = Xpt^^^ + t. This fact 
follows immediately from the optimality condition of the optimization problem (12.11) . Using the 
above equation, we may also compute the proximity map of a multiple of the ip norm, namely the 
case that w = 7|| ■ where 7 > 0. Indeed, for every x G W'-, there exists a value of A, depending 
only on 7 and x, such that the optimization problem (12.11) for w = 7II ■ ||p equals to the solution of 
the same problem for w = A|| ■ ||^. Hence the proximity map of the £p norm can be computed by 
(|2.2I) together with a simple line search. The cases that p G {1, 2} are simpler, see e.g. [7J. For 
p = 1 we obtain the well-known soft-thresholding operator, namely 

P™^A|| ||i = - A)+sign(x), (2.3) 
where, for every t G M, we define (t)+ = t if t > and zero otherwise; when p = 2 we have that 

( ^ / (11^112 -A) + ^ ifa;^0 
P--AIMI.(^) = I " " ifx = 0. ^^-^^ 

In our last example, we consider the ioo norm, which is defined, for every x G M'^ as ||a;||oo = 
max{|xj| : i G N^}. We have that 



^ |a;,|>Sfc 



P^o^A|| |u(a;) = min (\x\,- - A > sign(x 



where Sk is the fc-th largest value of the components of the vector |x| and k is the largest integer 
such that Yj\xi\>sS\^\^ ~ '^'^^ ^ ^' ^ proof of the above formula, see, for example [@ Sec. 5.4]. 

Finally, we recall some basic facts about fixed point theory which are useful for our study. For 
more information on the material presented here, we refer the reader to [37l. 

A mapping : M°' — )■ M'^ is called strictly non-expansive (or contractive) if there exists (3 G 
[0, 1) such that, for every X, ?/ G M'^, \\(p{x) — ip{y)\\2 < l3\\x — y\\2. If the above inequality holds for 
(3 = 1, the mapping is called nonexpansive. As noted in [|7l Lemma 2.4], both prox^ and / — prox^^ 
are nonexpansive. 

We say that x is a fixed point of a mapping if x = v'(x). The Picard iterates x",n G N, 
starting at xq G are defined by the recursive equation x" = (p{x"~^). It is a well-known fact 
that, if is strictly nonexpansive then has a unique fixed point x and lim„_j,oo x" = x. However, 
this result fails if (p is nonexpansive. We end this section by stating the main tool which we use to 
find a fixed point of a nonexpansive mapping (p. 

Theorem 2.1. (Opial n-average theorem / iJQl/ ) Let : ^ be a nonexpansive mapping, 
which has at least one fixed point and let := + (1 — K)ip. Then, for every k G (0, 1), the 
Picard iterates ofip^ converge to a fixed point ofip. 
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3 Examples of Composite Functions 



In this section, we show that several examples of penalty functions which have appeared in the 
literature fall within the class of linear composite functions (11.21) . 

We define for every d E N, x E and J C N^, the restriction of the vector x to the index set 
J as x\j = (xi : i E J). Our first example considers the Group Lasso penalty function, which is 
defined as 

uJchix) = ^ ||a;|jj|2 (3.1) 

where Ji are prescribed subsets of (also called the "groups") such that U^^^J^ = Nd- The 
standard Group Lasso penalty (see e.g. [36J) corresponds to the case that the collection of groups 
{Ji : i E Nk} forms a partition of the index set Nd, that is, the groups do not overlap. In this case, 
the optimization problem (12.11 ) for tu = ugl decomposes as the sum of separate problems and the 
proximity operator is readily obtained by applying the formula (12.41) to each group separately. In 
many cases of interest, however, the groups overlap and the proximity operator cannot be easily 
computed. 

Note that the function (|3.1I) is of the form (11.21) . We let di = \Ji\, m = ^^gp^^ di and define, 
for every z E M", uj{z) = JZeeN^ W^^h^ where, for every £ G we let ze = {zi : X)jGN^_i '^i < 
i < XljeN^ ^i)- Moreover, we choose B = [BJ , . . . , B^Y, where Be is a di x d matrix defined as 



1 ifj = M^ 
otherwise 



where for every J CNd and i E N| j|, we denote by J[i] the i-th largest integer in J. 

The second example concerns the Fused Lasso [|32l . which considers the penalty function 
X I— i- g{x) = J2ieNa 1 ~ It immediately follows that this function falls into the class 

(11.21) if we choose tu to be the ii norm and B the first order divided difference matrix 



B 



1 -1 
1 -1 



(3.2) 



The intuition behind the Fused Lasso is that it favors vectors which do not vary much across 
contiguous components. Further extensions of this case may be obtained by choosing B to be the 
incidence matrix of a graph, a setting which is relevant for example in online learning over graphs 
[[TT| . Other related examples include the anisotropic total variation, see for example, f2V\ . 

The next example considers composition with orthogonally invariant (01) norms. Specifically, 
we choose a symmetric gauge function h, that is, a norm h, which is both absolute and invariant 
under permutations Il35l and define the function cu : M*^^" — )■ [0, oo), at X by the formula 

uj{X) = h{a{X)) 

where cr(X) E [0, oo)^, r = mm{d, n) is the vector formed by the singular values of matrix X, 
in non-increasing order. An example of Ol-norm are Schatten p-norms, which correspond to the 
case that a; is the £p-norm. The next proposition provides a formula for the proximity operator 
of an Ol-norm. The proof is based on an inequality by von Neumann [|35l . sometimes called von 
Neumann's trace theorem or Ky Fan's inequality. 
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Proposition 3.1. With the above notation, it holds that 

prox,„,(X) = f/diag (prox,(a(X))) 

where X = Udiag{a{X))V^ and U and V are the matrices formed by the left and right singular 
vectors of X, respectively. 

Proof. The proof is based on an inequality by von Neumann [35], sometimes called von Neu- 
mann's trace theorem or Ky Fan's inequality. It states that (X, Y) < (cr(X), cr{Y)), with equality 
if and only if X and Y share the same ordered system of singular vectors. Note that 

\\X-Y\\l = \\X\\l + \\Y\\l-2{X,Y) 

> \\a{X)\\l + \\a{Y)\\l-2{a{X),a{Y)) 
= MX) - aiY)\\l 

and the equality holds if and only ifY = U diag(cr(F))y^. Consequently, we have that 

^-\\X-Y\\l + co{Y) > i|k(X)-prox,(a(X))||^ 

+/i(prox^((T(X))) . 

To conclude the proof we need to show that 7 := prox^(cr(X)) has the same ordering of a, that is, 
7 is non-increasing. Suppose on the contrary that there exists i,j G N^, i < j, such that ji < 7^. 
Let 7 be the vector obtained by flipping the i-th and j-th components of 7. A direct computation 
gives 

2 Ik - 7II2 + ^(7) - 2 ""^ ~ ^"^ ~ ^^^^ ^ ~ ~ 
Since the left hand side of the above equation is positive, this leads to a contradiction. ■ 

We can compose an Ol-norm with a linear transformation B, this time between two spaces 
of matrices, obtaining yet another subclass of penalty functions of the form (11.21 ). This setting 
is relevant in the context of multi-task learning. For example [10| chooses h to be the trace or 
nuclear norm and considers a specific linear transformation which model task relatedness, namely, 
that g{X) = ||cr (X(/ — ^ee^)) ||^, where e e M*^ is the vector all of whose components are equal 
to one. 



4 Fixed Point Algorithms Based on Proximity Operators 

We now propose optimization approaches which use fixed point algorithms for nonsmooth prob- 
lems. We shall focus on problem (11.11 ) under the assumption (11.21 ). We assume that / is a strongly 
smooth convex function, that is, V/ is Lipschitz continuous with constant L, and is a nondif- 
ferentiable convex function. A typical class of such problems occurs in regularization methods 
where / corresponds to a data error term with, say, the square loss. Our approach builds on proxi- 
mal methods and uses fixed point (also known as Picard) iterations for numerically computing the 
proximity operator. 
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4.1 Computation of a Generalized Proximity Operator with a Fixed Point 
Method 



As the basic building block of our methods, we consider the optimization problem (|l.ll) in the 
special case when / is a quadratic function, that is, 

minf^^y^Qy-x^y + LoiBy) -.y eW'Y (4.1) 

where x is a given vector in R'^ and Q a positive definite d x d matrix. 

Recall the proximity operator in Definition [ZT] Under the assumption that we can explicitly or 
in a finite number of steps compute the proximity operator of uj, our aim is to develop an algorithm 
for evaluating a minimizer of problem (I4.lt . We describe the algorithm for a generic Hessian Q, 
as it can be applied in various contexts. For example, it could lead to a second-order method for 
solving (11.11) . which will be the topic of future work. In this paper, we will apply the technique to 
the task of evaluating prox^^^. 

First, we observe that the minimizer of (14.11 ) exists and is unique. Let us call this minimizer y. 
Similar to Proposition 12. 11 we have the following proposition. 

Proposition 4.1. Ifu is a convex function on M™, Q a d x d positive definite matrix and x G M'^ 
then y is the solution of problem (14.11) if and only if 

Qy (^x-d{uoB){y). (4.2) 



The subdifferential d{ijj o B) appearing in the inclusion (14.21) can be expressed with the chain 
rule (see, e.g. L6J), which gives the formula 

d{uj oB) = B^ o {duo) o B . (4.3) 
Combining equations (14.21) and (14.31) yields the fact that 

Qy Ex- B^duj{By) . (4.4) 

This inclusion along with Proposition |2. H allows us to express y in terms of the proximity operator 
oiuj. To formulate our observation we introduce the affine transformation A : R" — )■ R™ defined, 
for fixed a; G M*^, A > 0, at ^ G R" by 

Az := (/ - XBQ-^B^)z + BQ-^x 
and the operator H -.W^ ^ M™ 

H := (^/-prox^) oA. (4.5) 

Theorem 4.1. Ifu is a convex function on R™, B G R™^*^, x G R*^, A is a positive number and y 
is the minimizer o/ (l4.1l) then 

y = Q-\x-\B^v) 



if and only ifvE R™ is a fixed point of H. 
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Proof. From (14.41 ) we conclude that y is characterized by the fact that y = Q~^{x — XB^v), where 
f is a vector in the set d (f) (By). Thus it follows that v e d (f) {BQ'^{x - XB^v)). Using 
Proposition 12 . 1 1 we conclude that 

BQ-\x~ XB^v) = prox^ (Av). (4.6) 

Adding and subtracting v on the left hand side and rearranging the terms we see that f is a fixed 
point of H. 

Conversely, if t> is a fixed point of H, then equation (14.61) holds. Using again Proposition 12. II 
and the chain rule (|4.3I ). we conclude that 

XB^v ed{ujoB) {Q-\x - XB^v)) 

Proposition 14. 1 [ together with the above inclusion now implies that Q~^{x—XB^v) is the minimizer 
of (Ol). ■ 

Since the operator (/ — prox^) is nonexpansive |I71 Lemma 2.1], then 

\\H{v) - H{w)\\2 < \\Av-Aw\\2 

< \\I - XBQ-^B^\\\\v -w\\2. 

We conclude that the mapping H is nonexpansive if the spectral norm of the matrix I — XBQ^^B^ 
is not greater than one. Let us denote by Xj, j E Nm, the eigenvalues of matrix BQ^^B^ . We see 
that H is nonexpansive provided that |1 — AAj| < 1, that is if < A < 2/Amax, where Amax is the 
spectral norm of BQ^^B^ . In this case we can appeal to Opial's Theorem |2.1| to find a fixed point 
of if. 

Note that if, for every j E N^, A^ > 0, that is, the matrix BQ^^B^ is invertible, then the 
mapping H is strictly nonexpansive when < A < 2/ Amax- In this case, the Picard iterates of H 
converge to the unique fixed point of H, without the need to use Opial's Theorem. 

We end this section by noting that, when Q = I, the above theorem provides an algorithm for 
computing the proximity operator of to o B. 

Corollary 4.1. Let u be a convex function on M™, B E W^^'^, x E W^, X a positive number and 
define the mapping v {I — prox^)((/ — XBB^)v + Bx). Then 

prox^oi?(a;) = x- XB^v 

if and only if v is a fixed point ofH. 

Thus, a fixed point iterative scheme like the above one can be used as part of any proximal 
method when the regularizer has the form (II. 2t . 

4.2 Accelerated First-Order Methods 

Corollary 14. 1! motivates a general proximal numerical approach to solving problem (|l.ll) (Algo- 
rithm [T). Recall that L is the Lipschitz constant of V/. The idea behind proximal methods - see 
[|71 m |28l [33l [34l and references therein - is to update the current estimate of the solution Xt using 
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Algorithm 1 Proximal & fixed point algorithm. 

xi, ai ^ 
fort=l,2,. .. do 

Compute xt+i ^ prox^^^ [at - ^V/(ai)) 

by the Picard-Opial process 
Update at+i as a function of Xt+i^Xt, . . . 
end for 



the proximity operator. This is equivalent to replacing / with its linear approximation around a 
point at specific to iteration t. The point at may depend on the current and previous estimates of 
the solution xt, Xt-i, . . . , the simplest and most common update rule being at = Xt- 

In particular, in this paper we focus on combining Picard iterations with accelerated first-order 
methods proposed by Nesterov [|27ll28l . These methods use an a update of a specific type, which 
requires two levels of memory of x. Such a scheme has the property of a quadratic decay in 
terms of the iteration count, that is, the distance of the objective from the minimal value is O (t^^) 
after T iterations. This rate of convergence is optimal for a first order method in the sense of the 
algorithmic model of [|25l . 

It is important to note that other methods may achieve faster rates, at least under certain con- 
ditions. For example, interior point methods Il29ll or iterated reweighted least squares Il8l [3T1 [B 
have been applied successfully to nonsmooth convex problems. However, the former require the 
Hessian and typically have high cost per iteration. The latter require solving linear systems at each 
iteration. Accelerated methods, on the other hand, have a lower cost per iteration and scale to larger 
problem sizes. Moreover, in applications where some type of thresholding operator is involved - 
for example, the Lasso (|2.3I) - the zeros in the solution are exact, which may be desirable. 

Since their introduction, accelerated methods have quickly become popular in various areas of 
applications, including machine leaming, see, for example, [^4l[T5l[T7l[T3l and references therein. 
However, their applicability has been restricted by the fact that they require exact computation 
of the proximity operator. Only then is the quadratic convergence rate known to hold, and thus 
methods using numerical computation of the proximity operator are not guaranteed to exhibit this 
rate. What we show here, is how to further extend the scope of accelerated methods and that, 
empirically at least, these new methods outperform current O methods while matching the 
performance of optimal O(^) methods. 

In Algorithm[2]we describe a version of accelerated methods influenced by [[33l[34l . Nesterov's 
insight was that an appropriate update of at which uses two levels of memory achieves the O (^) 

rate. Specifically, the optimal update is at+i ^ Xt+i + 6t+i (^j-^ — (a^t+i — Xt) where the 
sequence 9t is defined by = 1 and the recursive equation 

1 - gf+i ^ 1 

We have adapted [[33l Algorithm 2] (equivalent to FISTA flU) by computing the proximity operator 
of joB using the Picard-Opial process described in Section 1431 We rephrased the algorithm using 
the sequence pt := 1 — 9t + ^1 — 9t = I — dt+ for numerical stability. At each iteration, the 
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map At is defined by 

Atz := (^I - jBB^^ z - lfi(V/(at) - La^) 

and Ut as in (14.51) . By Theorem 14. 1[ the fixed point process combined with the x update are 
equivalent to x^+i ^ proxii^^ [at - \y f[pLt)). 



Algorithm 2 Accelerated & fixed point algorithm. 

Xi, «! ^ 

fort=l,2,. .. do 

Compute a fixed point v of Ht by Picard-Opial 

xt+i ^ at - iV/(at) - \B^v 
at+i ^ pt+ia;t+i - (pt+i - l)xt 
end for 



5 Numerical Simulations 

We have evaluated the efficiency of our method with simulations on different nonsmooth learning 
problems. One important aim of the experiments is to demonstrate improvement over a state 
of the art suite of methods (SLEP) [il6il in the cases when the proximity operator is not exactly 
computable. 

An example of such cases which we considered in Section ISTI is the Group Lasso with over- 
lapping groups. An algorithm for computation of the proximity operator in a finite number of 
steps is known only in the special case of hierarchy-induced groups [13|. In other cases such 
as groups induced by directed acyclic graphs [38] or more complicated sets of groups, the best 
known theoretical rate for a first-order method is O [j^) . We demonstrate that such a method can 
be improved. 

Moreover, in Section 15^ we report efficient convergence in the case of a composite £i penalty 
used for graph prediction [jTT]. In this case, matrix B is the incidence matrix of a graph and the 
penalty is Yl W^i ~ where E is the set of edges. Most work we are aware of for the 

composite ii penalty applies to the special cases of total variation [|3l or Fused lasso [jlQ], in which 
B has a simple structure. A recent method for the general case (5) which builds on Nesterov's 
O (|;) smoothing technique [27] does not have publicly available software yet. 

Another advantage of Algorithm |2] which we highlight is the high efficiency of Picard itera- 
tions for computing different proximity operators. This requires only a small number of iterations 
regardless of the size of the problem. We also report a roughly linear scalability with respect to the 
dimensionality of the problem, which shows that our methodology can be applied to large scale 
problems. 

In the following simulations, we have chosen the parameter from Opial's theorem k = 0.2. The 
parameter A was set equal to t — — , where Amax and Amin are the largest and smallest eigenval- 
ues, respectively, of j^BB^ . We have focused exclusively on the case of the square loss and we 
have computed L using singular value decomposition (if this were not possible, a Frobenius esti- 
mate could be used). Finally, the implementation ran on a 16GB memory dual core Intel machine. 



10 




Figure 1: Objective function vs. iteration for the overlapping groups data (d = 3500). Note that 
Picard-Nesterov terminates earlier within e. 



Picard 4 
difference 




1 2 3 4 5 6 7 

iteration count 



Figure 2: £2 difference of successive Picard iterates vs. Picard iteration for the overlapping groups 
data (d = 3500). 



The Matlab code is available at http : //ttic . uchicago . edu/~argyriou/code/ 
index . html. 

5.1 Overlapping Groups 

In the first simulation we considered a synthetic data set which involves a fairly simple group topol- 
ogy which, however, cannot be embedded as a hierarchy. We generated data A G R*^"^, with s = 
[0.7d] from a uniform distribution and normalized the matrix. The target vector x* was also gen- 
erated randomly so that only 21 of its components are nonzero. The groups used in the regularizer 
WGL-seeeq. dlB-are: {1, 5}, {5, 9}, {9, 13}, {13, 17}, {17, 21}, {4, 22, 30}, 
{8, 31, ...,40}, {12, 41,. ..,50}, {16, 51,. ..,60}, {20, 61, ...,70}, {71,..., 80},..., {d- 9, ...,rf}. 

That is, the first 5 groups form a chain, the next 5 groups have a common element with one 
of the first groups and the rest have no overlaps. An issue with overlapping group norms is the 
coefficients assigned to each group (see [12J for a discussion). We chose to use a coefficient of 1 
for every group and compensate by normalizing each component of x* according to the number 
of groups in which it appears (this of course can only be done in a synthetic setting like this). 
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Figure 3: Average measures vs. dimensionality for the overlapping groups data. Top: number 
of iterations. Bottom: CPU time. Note that this time can be reduced to a fraction with a C 
implementation. 



0.4 



0.35 




Figure 4: Objective function vs. iteration for the hierarchical overlapping groups. 



The outputs were then generated as y = Ax* + noise with zero mean Gaussian noise of standard 
deviation 0.001. 

We used a regularization parameter equal to 10^^. We ran the algorithm for (i = 1000, 1100, . . . , 
4000, with 10 random data sets for each value of d, and compared its efficiency with SLEP. The 
solutions found recover the correct pattern without exact zeros due to the regularization. Figure [H 
shows the number of iterations T in Algorithm |2] needed for convergence in objective value within 
e = 1Q~^. SLEP was run until the same objective value was reached. We conclude that we out- 
perform SLEP's O (|;) method. Figured demonstrates the efficiency of the inner computation of 
the proximity map at one iteration t of the algorithm. Just a few Picard iterations are required for 
convergence. The plots for different t are indistinguishable. 

Similar conclusions can be drawn from the plots in Figure |3] where average counts of iterations 
and CPU time are shown for each value of d. We see that the number of iterations depends almost 
linearly on dimensionality and that SLEP requires an order of magnitude more iterations - which 
grow at a higher rate. Note also that the cost per iteration is comparable between the two methods. 
We also observed that computation of the proximity map is insensitive to the size of the problem 
(it only requires 7 — 8 iterations for all d). Finally, we report that CPU time grows linearly with 
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dimensionality. To remove various overheads this estimate was obtained from Matlab's profiling 
statistics for the low-level functions called. A comparison with SLEP is meaningless since the 
latter is a C implementation. 

Besides outperforming the O(^) method, we also show that the Picard-Nesterov approach 
matches SLEP's O(^) method for the tree structured Group Lasso [1T81 . To this end, we have 
imitated an experiment from [|T3l Sec. 4.1] using the Berkeley segmentation data se^ We have 
extracted a random dictionary of 71 16 x 16 patches from these images, which we have placed 
on a balanced tree with branching factors 10, 2, 2 (top to bottom). Here the groups correspond 
to all subtrees of this tree. We have then learned the decomposition of new test patches in the 
dictionary basis by Group Lasso regularization (I3.lt . As Figured shows, our method and SLEP 
are practically indistinguishable. 

5.2 Graph Prediction 

The second simulation is on the graph prediction of [11] in the limit of p = 1 (composite ii). We 
constructed a synthetic graph of d vertices, d = 100, 120, . . . , 360 with two clusters of equal size. 
The edges in each cluster were selected from a uniform draw with probability | and we explicitly 
connected d/25 pairs of vertices between the clusters. The labeled data y were the cluster labels of 
s = 10 randomly drawn vertices. Note that the effective dimensionality of this problem is 0{d'^). 
At the time of the paper's writing there is not an accelerated method with software available online 
which handles a generic graph. 

First, we observed that the solution found recovered perfectly the clustering. Next, we studied 
the decay of the objective function for different problem sizes (Figure |5]). We noted a striking 
difference from the case of overlapping groups in that convergence now is not monotone^ The na- 
ture of decay also differs from graph to graph, with some cases making fast progress very close 
to the optimal value but long before eventual convergence. This observation suggests future mod- 
ifications of the algorithm which can accelerate convergence by a factor. As an indication, the 
distance from the optimum was just 2.2 ■ 10"*^, 5.4 • 10"^, 1.5 ■ 10"^ at iteration 611, 821, 418 for 
d = 100, 120, 140, respectively. We verified in this data as well, that Picard iterations converge 
very fast (Figure |6]). Finally in Table 15^ we report average iteration numbers and running times. 
These prove the feasibility of solving problems with large matrices B even using a "quick and 
dirty" Matlab implementation. 

In addition to a random incidence matrix, one may consider the special case of Fused Lasso or 
Total Variation in which B has the simple form (|3.2I) . It has been shown how to achieve the optimal 
O (^) rate for this problem in [3]. We applied Fused Lasso (without Lasso regularization) to the 
same clustering data as before and compared SLEP with the Picard-Nesterov approach. As Figure 
|7]shows, the two trajectories are identical. This provides even more evidence in favor of optimality 
of our method. 

'http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/ 

^ There is no monotonicity guarantee for Nesterov's accelerated method. 
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Figure 5: Objective function vs. iteration for the graph data. Note the progress in the early stages 
in some cases. 




5 10 15 20 

iteration count 

Figure 6: £2 difference of successive Picard iterates vs. Picard iteration for the graph data (d = 
100). 



6 Conclusion 

We presented an efficient first order method for solving a class of nonsmooth optimization prob- 
lems, whose objective function is given by the sum of a smooth term and a nonsmooth term, which 
is obtained by linear function composition. The prototypical example covered by this setting in a 
linear regression regularization method, in which the smooth term is an error term and the nons- 
mooth term is a regularizer which favors certain desired parameter vectors. An important feature 
of our approach is that it can deal with richer classes of regularizers than current approaches and at 
the same time is at least as computationally efficient as specific existing approaches for structured 
sparsity. In particular our numerical simulations demonstrate that the proposed method matches 
optimal 0(7^) methods on specific problems (Fused Lasso and tree structured Group Lasso) while 
improving over available 0{j;) methods for the overlapping Group Lasso. In addition, it can han- 
dle generic linear composite regularization problems, for many of which accelerated methods do 
not yet exist. In the future, we wish to study theoretically whether the rate of convergence is 
O (^) , as suggested by our numerical simulations. There is also much room for further accelera- 
tion of the method in the more challenging cases by using practical heuristics. At the same time, it 
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d 


no. iterations 


CPU time (sees.) 


100 


2599.6 


21.461 


120 


3680.0 


54.745 


140 


4351.8 


118.61 


160 


3124.8 


164.21 


180 


2845.8 


241.69 


200 


3476.2 


359.75 


220 


4490.0 


911.67 


240 


4490.0 


911.67 


260 


3639.2 


930.8 



Table 1: Graph data. Note that the effective dhO{d'^). CPU time can be reduced to a fraction with 
a C implementation. 





\ 

\ 






SLEP 

Picard - Nestercv 


3 










objective 

2 
1 





















1000 2000 3000 4000 5000 

iteration count 



Figure 7: Objective function vs. iteration for the Fused Lasso {d = 100). The two trajectories are 
identical. 



will be valuable to study further applications of our method. These could include machine learning 
problems ranging from multi-task learning, to multiple kernel learning and to dictionary learning, 
all of which can be formulated as linearly composite regularization problems. 
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7 Appendix 

In this appendix, we collect some basic facts about fixed point theory which are useful for our 
study. For more information on the material presented here, we refer the reader to [|37l . 

Let X be a closed subset of M"^. A mapping (/? : X — )■ X is called strictly non-expansive (or 
contractive) if there exists A e [0,1) such that, for every x,y G X, 

ll^(x) -(p{y)\\ < X\\x-y\\. 

If the above inequality holds for A = 1, the mapping is called nonexpansive. We say that x is a 
fixed point of (y9 if x = (/^(x). The Picard iterates x", n E N starting at xq G X are defined by the 
recursive equation x" = (f{x"'~^). 

It is a well-knwon fact that, if (p is strictly nonexpansive then ip has a unique fixed point x and 
lim„_s.oo a;" = x. However, this result fails if (p is nonexpansive. For example, the map (p{x) = x+1 
does not have a fixed point. On the other hand, the identity map has infinitely many fixed points. 

Definition 7.1. Let X be a closed subset ofM.'^. A map ip : X ^ X is called asymptotically regular 
provided that Xmin^oo \\x'^^^ — x"|| = 0. 

Proposition 7.1. Let X be a closed subset ofW^ and ip : X ^ X such that 
L if is nonexpansive; 

2. ip has at least one fixed point; 

3. ip is asymptotically regular 

Then the sequence {x" : n G N} converges to a fixed point of ip. 

Proof. We divide the proof in three steps. 

Step 1: The Picard iterates are bounded. Indeed, let x be a fixed point of ip. We have that 

||a;"+^ -x|| = \\ip{x^) - ip{x)\\ < ||x" -x|| < • ■ ■ < ||x° - x||. 

Step 2: Let {x"* : G N} be a convergent subsequence, whose limit we denote by y. We will 
show that y is a fixed point of ip. Since ip is continuous, we have that \im.k^ao{'^^^ — 'p{x^'')) = 
y — ^p{y), and since ip is asymptotically regular y — ip{y) = 0. 

Step 3: The whole sequence converges. Indeed, following the same reasoning in the proof 
of Step 1, we conclude that the sequence {||x" — y\\ : n E N} is non-increasing. Let a = 
lim„_5.oo — y\\- Since limfc_^oo — vW = 0, we conclude that a = and, so, lim„^oo a;" = 

y- ■ 
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We note that in general, without the asymptotically regularity assumption, the Picard iterates 
do not converge. For example, consider (f{x) = —x. Its only fixed point is x = 0; if we start from 
a;° 7^ the Picard iterates will oscillate. Moreover, if Lp(x) = x + I, which is nonexpansive, the 
Picard iterates diverge. 

We now discuss the main tool which we use to find a fixed point of a nonexpansive mapping Lp. 

Theorem 7.1. ( Opial n-average theorem /IJO'/j Let X be a closed convex subset ofM.'^, (f : X ^ X 
a nonexpansive mapping, which has at least one fixed point and let (f,^ := kI + {1 — K)(p. Then, 
for every k G (0, 1), the Picard iterates ofip^ converge to a fixed point ofip. 

We prepare for the proof with two useful lemmas. 

Lemma 7.1. If k g (0, 1), u,w e W^, \\u\\ < \\w\\, then 

— k)^W — U\ < 11^11 — + (1 — Kj-Ull 



Proof. The assertion follows from £2 strong convexity, 

k(1 — — + \kW + (1 — 

= K\wf + (1 - i^)\uf < WwW^. 

■ 

Lemma 7.2. If {u^ : n G N} and {w" : n G N} are sequences in M'^ such that lim„_j.oo ll""^"!! = 
Wu^W < \\w"\\ anJUm„_>oo \\i^w"' + (1 — k,)u'^\\ = 1, then lim^^oo'it'" — = 0. 

Proof. Apply Lemma ItTI to note that 

k{1 - - M"f < ||w;"f - + (1 - 

By hypothesis the right hand side tends to zero as n tends to infinity and the result follows. ■ 

Proof of Theorem I7.1i Let {x" : n G N} be the iterates of ip^. We will show that ip^ is asymp- 
totically regular. The result will then follow by Proposition |7T| and the fact that and have the 
same set of fixed points. 

Let x""*"^ = ftx" + (1 — Note that, if u is fixed point of ip,^, then 

- u\\ < - m|| < ■ ■ • < - u\\ . 

Let d := lim„_s.oo — If ci = the result is proved. We will show that if d > we 
contradict the hypotheses of the theorem. For every n G N, we define w"^ = d (x" — u) and 
= d (v5(x") — u). Note that the sequences {w" : n G N} and {u" : G N} satisfy 
the hypotheses of Lemma W72\ Thus, we have that lim„_j.oo(x" — v^(x")) = 0. Consequently 
^n+i _ = (1 _ K){ip{x^) — x") — )■ 0, showing that {x" : n G N} is asymptotically regular. ■ 
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